De Novo Peptide Sequencing In Real Time

Goal : Protein Identification from the Mass and Intensity readings of the ions from the Taandem Mass Spectrometer

But before we go into details regarding the method let us first understand what we mean by Tandem Mass Spectroscopy.

Tandem Mass Spectroscopy

What is Mass Spectroscopy?

It is an analytcial chemistry technique that helps identify the amount and type of chemicals present in a sample by measuring the mass-to-charge ratio of ions.

alt text

So, are Tandem Mass Spectroscopy and Mass Spectroscopy the same?

No they are not. Tandem Mass Spectroscopy uses a tandem mass spectrometer. A Tandem Mass Spectrometer can be thought of as two mass spectrometers connected in series by a chamber that can break molecules into pieces.

When do we want to use Tandem Mass Spectroscopy?

Most of the times it is impractical to identify all the compounds and thus it is better to identify certain specific compunds which are of interest to our research. In a Tandem Mass Spectrometer, a sample is first sorted or weighed in the first Mass Spectrometer, it is then broken down into multiple pieces in the collision cell and then a piece or pieces are weighed and sorted in the second Mass Spectrometer

Method We Followed

We took the following steps to identify the proteins:

Parsing the possible Amino Acid Strings from the FASTA file
Converting these Amino Acid Strings into possible Tryptic Peptides
Parsing the Mass Spectrometer data from the MGF file
Calculating the mass and charge of the spectrum of that particular scan
Filtering down the possible peptides using the mass of the spectrum
Selection of peptide using the scoring functions
Valdiation

Parsing the protein data

The first thing that we though would make sense to do is to just read in the database of the available proteins(amino acid chains) and store them so that we can use them later to form tryptic peptides. For that we basically wrote a scraper fucntion that reads the file and scrapes all the proteins from it.

def readFasta(filename): proteins = {} inf = open(filename,'rU') name = "" dna = "" for line in inf: line = line.strip() if line[0] == '>': if name != "" : proteins[name] = dna name = line[1:] dna = "" else : dna += line proteins[name] = dna

So this basically gave us a dictionary of all the proteins in the file which can be easily accessible using the Accession.

FASTA : A text-based format for representing either nucleotide sequences or peptide sequences

Converting the Proteins into Tryptic Peptides

What is a Tryptic Peptide?

A Tryptic peptide is a peptides that have been digested by Trypsin at sites [KR]|[^P].

Peptide is a shrot chain of amino acids linked by amide bonds

Conversion

To convert the amino acid sequences into tryptic peptides, we wrote a regular expression based function that split the seqeuences at the Trypsin digestion sites.

def sep_peptides(peptide): peptides = [] split_matches = re.finditer(r'[RK](?!P)', peptide) prev_split_location = 0 for m in split_matches: split_location = m.end() peptides.append(peptide[prev_split_location:split_location]) prev_split_location = split_location return peptide

Both the first 2 steps can be achieved by executing the following line of code either with ups.fasta (the small daatbase) or with UP000005640_9606.fasta (the big database) and the output will be saved in a regular .txt file



In [5]:

    
run read_fasta.py ups.fasta

Parsing the Mass Spectrometer Data

To read and store the data for future purposes we wrote a parser function that stored the mass-intensity pairs of the spectrum along with the metadata like the mass to charge ratio, the scan number and the charge of the ion.

Mascot Generic Format(mgf) is a format in which each Mass Spectrometer readings is stored as a list of pairs of mass and intensity

`def read_mgf(fp): metadata = {} spectrums = [] with open(fp, 'r') as f: for line in f: line = line.strip() if line=="BEGIN IONS": spectrum = [] metadata = {} if line=="END IONS": spectrums.append( (metadata, spectrum) )

        if line:
            if line.count('=') == 1:
                label, value = line.split('=')
                if not label == 'CHARGE':
                    if '.' in value:
                        metadata[label] = float(value)
                    else:
                        metadata[label] = int(value)
                else:
                    metadata[label] = value
            elif line[0].isdigit():
                mass, intensity = line.split(' ')
                spectrum.append( (float(mass), float(intensity)) )
            else:
                pass
return spectrum`

Calculating Mass and Charge

Having been already stored as metadata whilst parsing of the file, this one was a fairly easier step as it just required us to obtain the required numbers from the metadata.

Charge was already stored in the file and hence was accessible from the metadata

def calc_charge(metadata): return int(metadata['CHARGE'][0]

Mass however wasn't already stored in the file and arithmetic operations were required to find out the mass

The formula used was (m+ 18.01)/z + 1.007 = PEPMASS(Stored in MGF File)

def calc_mass(metadata,charge): pep_mass = metadata['PEPMASS'] mass = (pep_mass - 1.007)*charge - 18.01 return mass

Filtering Using Mass

The basic idea was that the mass of the peptide should match or be close to the mass of the spectrum. The accepted error was +/- 0.5 Daltons. The steps that we followed to achieve filtering by mass were as follows :

Calculate the mass of the peptide
Filter out the peptides who are not within 0.5 of the mass of the spectrum

Dalton : 1/12 of Carbon-12 atom

Mass of the Peptide

To calculate the mass of the peptide we used the Amino Acid Mass Table. The only change we made to the table was to add 57.0219 to the mass of the Cytasine due to the changes made to it inside the Mass Spectrometer.

Filtering Out

We just needed to get the list of all possible candidates which we achieved using:

def find_candidates(peptides,mass): candidates = [] for peptide in peptides: if abs(peptides[peptide] - mass) < 0.5 : candidates.append(peptide) return candidate

Selection Using Scoring Function

Y-ion mass = Amino Acid Mass + 19.018

A Simple Scoring Function :

For “suffix mass” m, f(m)=1 if corresponding y-ion matched a peak; f(m)=0 if not.
score(P) = sum of all f(m)

A Better Scoring Function :

f(m) = log (1 + 100 * relative intensity of matched y-ion)
Relative intensity is ratio between current peak and the highest peak in the spectrum.

All the results can be generated running the following command

> python read_mgf.py <fasta_file> <mgf_file>

Validation

We valdiated the results stored in the candidates*.txt file with a python script that compared the result of the commercial software for a given scan number with our result.

The command line command that did it was as follows:

python diff.py candidates*.txt

Results and Improvements

Results with Small Database

Scans Commercial Software found a result for : 282

Correct Scans With Simple Scoring Function and only Y-ions : 136

Correct Scans With Simple Scoring Function and both Y-ions and B-ions : 136

Correct Scans With Better Scoring Function and only Y-ions : 177

Correct Scans With Better Scoring Function and both Y-ions and B-ions : 178

Results with Big Databse

Correct Scans with Better Scoring Function and only Y-ions : 125

Correct Scans with Better Scoring Function and both Y-ions and B-ions : 141

Possibly our scoring function isn't the most effective and it has problems dealing with a large database

Future Improvements

Try a random "decoy" protein sequence database
Normalize the peptide spectrum match score
Search for non tryptic peptides



In [ ]: